## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'
=======
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting.
We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 1346 rows look:
=======We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 1622 rows look:
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f| 01 | 02 | 03 | 04 | 05 | ||
|---|---|---|---|---|---|---|
| attr_o | 6 | 6 | 10 | 6 | <<<<<<< HEAD =======||
| sinc_o | 7 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f5 | ||||
| sinc_o | 8 | 7 | 10 | 8 | 9 | |
| intel_o | 8 | 10 | 10 | 6 | 9 | |
| fun_o | <<<<<<< HEAD8 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f7 | 10 | 8 | 7 | |
| amb_o | 7 | 6 | 10 | 10 | 9 | |
| shar_o | <<<<<<< HEAD6 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f5 | 10 | 10 | 5 | |
| field_cd | 1 | 1 | 1 | 1 | 1 | |
| race | 2 | 2 | 2 | 2 | 2 | |
| goal | <<<<<<< HEAD2 | 2 | 2 | 2 | 2 | |
| date | 4 | 4 | 4 | 4 | 4 | |
| go_out | =======1 | 1 | 1 | 1 | 1 | |
| date | 5 | 5 | 5 | 5 | 5 | |
| go_out | 1 | 1 | 1 | 1 | 1 | |
| career_c | 1 | 1 | 1 | 1 | 1 | |
| sports | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f1 | 1 | 1 | 1 | 1 | |
| career_c | =======tvsports | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f1 | 1 | 1 | 1 | 1 |
| sports | =======exercise | 6 | 6 | 6 | 6 | 6 |
| dining | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f7 | 7 | 7 | 7 | 7 | |
| tvsports | 4 | 4 | 4 | 4 | 4 | |
| exercise | =======||||||
| museums | 6 | 6 | 6 | 6 | 6 | |
| art | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f7 | 7 | 7 | 7 | 7 | |
| dining | =======||||||
| hiking | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f7 | 7 | 7 | 7 | 7 | |
| museums | 6 | 6 | 6 | 6 | 6 | |
| art | 8 | 8 | 8 | 8 | 8 | |
| hiking | 6 | 6 | 6 | 6 | 6 | |
| gaming | 6 | 6 | 6 | 6 | 6 | |
| clubbing | 8 | 8 | 8 | 8 | 8 | |
| reading | 6 | 6 | 6 | 6 | 6 | |
| tv | =======||||||
| gaming | 5 | 5 | 5 | 5 | 5 | |
| clubbing | 7 | 7 | 7 | 7 | 7 | |
| reading | 7 | 7 | 7 | 7 | 7 | |
| tv | 7 | 7 | 7 | 7 | 7 | |
| theater | 9 | 9 | 9 | 9 | 9 | |
| movies | 7 | 7 | 7 | 7 | 7 | |
| concerts | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f8 | 8 | 8 | 8 | 8 | |
| theater | 6 | 6 | 6 | 6 | 6 | |
| movies | 6 | 6 | 6 | 6 | 6 | |
| concerts | 3 | 3 | 3 | 3 | 3 | |
| music | 7 | 7 | 7 | 7 | 7 | |
| shopping | <<<<<<< HEAD =======1 | 1 | 1 | 1 | 1 | |
| yoga | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f8 | 8 | 8 | 8 | 8 | |
| yoga | 3 | 3 | 3 | 3 | 3 |
We followed the following approach to proceed with classification (as explained in class)
Let’s follow these steps.
We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).
<<<<<<< HEADIn our case we use 1076 observations in the estimation data, 135 in the validation data, and 135 in the test data.
=======In our case we use 1297 observations in the estimation data, 162 in the validation data, and 163 in the test data.
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481fOur dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.
| Class 1 | Class 0 | |||
|---|---|---|---|---|
| # of Observations | <<<<<<< HEAD547 | 529 | =======572 | 725 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
while in the validation sample they are:
| Class 1 | Class 0 | |||
|---|---|---|---|---|
| # of Observations | <<<<<<< HEAD77 | 58 | =======60 | 102 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Below are the statistics of our independent variables across the two classes, class 1, “selected”
| min | 25 percent | median | mean | 75 percent | max | std | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| attr_o | 1 | <<<<<<< HEAD6.0 | 7 | 7.37 | 8.0 | 10 | 1.51 | ||||||
| sinc_o | 0 | 7.0 | 8 | 7.55 | 8.5 | 10 | 1.56 | =======6 | 7 | 7.35 | 8 | 10 | 1.45 |
| sinc_o | 3 | 7 | 8 | 7.65 | 9 | 10 | 1.44 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| intel_o | 3 | <<<<<<< HEAD7.0 | 8 | 7.65 | 8.0 | 10 | 1.32 | ||||||
| fun_o | 0 | 6.0 | 7 | 7.34 | 8.0 | 10 | 1.62 | ||||||
| amb_o | 3 | 6.0 | 7 | 7.14 | 8.0 | 10 | 1.59 | =======7 | 8 | 7.74 | 9 | 10 | 1.26 |
| fun_o | 2 | 6 | 7 | 7.29 | 8 | 10 | 1.47 | ||||||
| amb_o | 2 | 6 | 7 | 7.05 | 8 | 10 | 1.60 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| shar_o | 0 | 5.0 | 7 | <<<<<<< HEAD6.48 | 8.0 | 10 | 1.89 | =======6.51 | 8 | 10 | 1.71 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| field_cd | 1 | <<<<<<< HEAD3.0 | 8 | 6.71 | 10.0 | 16 | 4.41 | =======4 | 8 | 7.10 | 9 | 17 | 3.62 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| race | 1 | 2.0 | 2 | <<<<<<< HEAD2.66 | 4.0 | 6 | 1.28 | =======2 | 2.60 | 3 | 6 | 1.21 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| goal | 1 | <<<<<<< HEAD1.0 | 2 | 1.95 | 2.0 | 6 | 1.28 | =======1 | 2 | 2.34 | 3 | 6 | 1.54 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| date | 1 | <<<<<<< HEAD4.0 | 5 | 4.92 | 6.0 | 7 | 1.61 | ||||||
| go_out | =======4 | 5 | 4.84 | 6 | 7 | 1.35 | |||||||
| go_out | 1 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f1 | 1.0 | 2 | <<<<<<< HEAD2.04 | 3.0 | 6 | 0.98 | =======1.91 | 2 | 6 | 0.99 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| career_c | 1 | 2.0 | 4 | <<<<<<< HEAD4.99 | 7.0 | 15 | 3.48 | =======4.97 | 7 | 17 | 3.36 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| sports | 1 | <<<<<<< HEAD4.0 | 7 | 6.04 | 8.0 | 10 | 2.78 | =======5 | 7 | 6.59 | 9 | 10 | 2.35 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| tvsports | 1 | <<<<<<< HEAD2.0 | 4 | 4.50 | 7.0 | 10 | 2.95 | =======2 | 4 | 4.50 | 7 | 10 | 2.63 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| exercise | 1 | <<<<<<< HEAD5.0 | 7 | 6.63 | 9.0 | 10 | 2.50 | ||||||
| dining | 4 | 7.0 | 8 | 7.91 | 9.0 | 10 | 1.79 | ||||||
| museums | 2 | 7.0 | 7 | 7.24 | 9.0 | 10 | 1.87 | ||||||
| art | 2 | 6.0 | 7 | 6.94 | 8.0 | 10 | 1.96 | ||||||
| hiking | 0 | 4.0 | 7 | 5.97 | 8.0 | 10 | 2.55 | =======5 | 6 | 6.27 | 8 | 10 | 2.21 |
| dining | 3 | 7 | 8 | 7.75 | 9 | 10 | 1.69 | ||||||
| museums | 3 | 5 | 7 | 6.70 | 8 | 10 | 2.13 | ||||||
| art | 1 | 4 | 6 | 6.42 | 9 | 10 | 2.57 | ||||||
| hiking | 1 | 3 | 6 | 5.63 | 8 | 10 | 2.50 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| gaming | 1 | 1.0 | 2 | <<<<<<< HEAD3.30 | 5.0 | 14 | 2.59 | =======4 | 3.93 | 5 | 14 | 2.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| clubbing | 1 | <<<<<<< HEAD4.0 | 6 | 5.79 | 8.0 | 10 | 2.63 | ||||||
| reading | 2 | 7.0 | 8 | 7.84 | 9.0 | 10 | 1.96 | =======4 | 7 | 6.12 | 8 | 10 | 2.40 |
| reading | 1 | 6 | 8 | 7.44 | 9 | 10 | 1.99 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| tv | 1 | <<<<<<< HEAD2.0 | 6 | 5.15 | 7.0 | 10 | 2.83 | =======3 | 5 | 4.92 | 7 | 10 | 2.23 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| theater | 1 | <<<<<<< HEAD5.0 | 8 | 6.98 | 9.0 | 10 | 2.38 | =======5 | 7 | 6.56 | 9 | 10 | 2.43 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| movies | 2 | 7.0 | 8 | <<<<<<< HEAD8.03 | 10.0 | 10 | 1.86 | ||||||
| concerts | 1 | 6.0 | 7 | 7.18 | 9.0 | 10 | 2.15 | ||||||
| music | 1 | 7.0 | 8 | 8.00 | 10.0 | 10 | 1.81 | =======7.85 | 9 | 10 | 1.82 | ||
| concerts | 2 | 5 | 7 | 6.77 | 8 | 10 | 2.00 | ||||||
| music | 4 | 7 | 8 | 7.78 | 9 | 10 | 1.73 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| shopping | 1 | <<<<<<< HEAD4.5 | 7 | 6.30 | 8.0 | =======4 | 6 | 5.54 | 7 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f10 | 2.70 | ||
| yoga | 1 | <<<<<<< HEAD2.0 | 4 | 4.54 | 7.0 | 10 | 2.84 | =======2 | 4 | 4.30 | 7 | 10 | 2.70 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
and class 0, “not selected”:
| min | 25 percent | median | mean | 75 percent | max | std | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| attr_o | <<<<<<< HEAD0 | 4 | 6 | 5.55 | 7 | 10 | 1.85 | =======1 | 4 | 6 | 5.38 | 7 | 10 | 1.80 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| sinc_o | 0 | 6 | 7 | 6.84 | 8 | 10 | <<<<<<< HEAD1.85 | |||||||
| intel_o | 0 | =======1.84 | ||||||||||||
| intel_o | 1 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f6 | 7 | 7.04 | 8 | 10 | <<<<<<< HEAD1.65 | =======1.56 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||||
| fun_o | 0 | 5 | 6 | <<<<<<< HEAD5.82 | 7 | 11 | 1.99 | |||||||
| amb_o | 0 | 5 | 6 | 6.55 | 8 | 10 | 1.83 | =======5.67 | 7 | 10 | 1.94 | |||
| amb_o | 1 | 5 | 7 | 6.50 | 8 | 10 | 1.81 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||||||
| shar_o | 0 | 3 | 5 | <<<<<<< HEAD4.85 | 6 | 10 | 2.05 | =======4.72 | 6 | 10 | 2.06 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| field_cd | 1 | <<<<<<< HEAD3 | 8 | 6.56 | 10 | 16 | 4.13 | =======5 | 8 | 7.27 | 10 | 17 | 3.40 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| race | 1 | 2 | 2 | <<<<<<< HEAD2.64 | 4 | 6 | 1.15 | =======2.69 | 4 | 6 | 1.31 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| goal | 1 | 1 | 2 | <<<<<<< HEAD2.10 | 2 | 6 | 1.38 | =======2.58 | 4 | 6 | 1.69 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| date | 1 | 4 | <<<<<<< HEAD6 | 5.33 | 7 | 7 | 1.48 | =======5 | 5.04 | 6 | 7 | 1.26 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| go_out | 1 | 1 | 2 | <<<<<<< HEAD2.23 | 3 | 6 | 1.23 | =======2.18 | 3 | 6 | 1.22 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| career_c | 1 | 2 | <<<<<<< HEAD3 | 4.58 | 7 | 15 | 3.20 | =======6 | 5.11 | 7 | 17 | 3.31 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| sports | 1 | <<<<<<< HEAD3 | 6 | 5.76 | 8 | 10 | 2.73 | =======5 | 7 | 6.48 | 9 | 10 | 2.57 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| tvsports | 1 | 2 | 4 | <<<<<<< HEAD4.44 | 7 | 10 | 2.87 | =======4.54 | 7 | 10 | 2.99 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| exercise | 1 | 5 | <<<<<<< HEAD7 | 6.68 | 8 | 10 | 2.39 | |||||||
| dining | 4 | 7 | 8 | 8.02 | 9 | 10 | 1.63 | =======6 | 5.93 | 8 | 10 | 2.26 | ||
| dining | 3 | 6 | 8 | 7.68 | 9 | 10 | 1.76 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||||||
| museums | 2 | <<<<<<< HEAD6 | 8 | 7.37 | 9 | 10 | 1.90 | |||||||
| art | 2 | 5 | 8 | 7.08 | 8 | 10 | 2.02 | |||||||
| hiking | 0 | 3 | 6 | 5.52 | 7 | 10 | 2.67 | |||||||
| gaming | 1 | 1 | 3 | 3.36 | 5 | 14 | 2.49 | =======5 | 7 | 6.52 | 8 | 10 | 2.23 | |
| art | 1 | 4 | 6 | 6.34 | 9 | 10 | 2.45 | |||||||
| hiking | 1 | 3 | 6 | 5.67 | 8 | 10 | 2.69 | |||||||
| gaming | 1 | 2 | 4 | 4.02 | 6 | 14 | 2.66 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||||||
| clubbing | 1 | 4 | 6 | <<<<<<< HEAD5.76 | =======5.71 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f8 | 10 | 2.27 | ||||||
| reading | <<<<<<< HEAD2 | 7 | 8 | 8.07 | 9 | 10 | 1.68 | =======1 | 7 | 8 | 7.63 | 9 | 10 | 2.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| tv | 1 | 3 | <<<<<<< HEAD6 | 5.67 | 8 | 10 | 2.86 | =======5 | 4.91 | 6 | 10 | 2.06 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| theater | 1 | 5 | 7 | 6.35 | 8 | <<<<<<< HEAD7.48 | 9 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f10 | 2.19 | |||||
| movies | 2 | <<<<<<< HEAD8 | 9 | 8.34 | 10 | 10 | 1.71 | =======7 | 8 | 8.00 | 9 | 10 | 1.61 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| concerts | 1 | 5 | 7 | 6.68 | 8 | <<<<<<< HEAD7.26 | 9 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f10 | 2.15 | |||||
| music | <<<<<<< HEAD1 | 7 | 8 | 7.85 | 9 | 10 | 2.03 | =======4 | 7 | 8 | 7.79 | 9 | 10 | 1.56 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| shopping | 1 | 3 | 5 | 5.18 | 7 | <<<<<<< HEAD6.52 | 9 | 10 | 2.53 | =======10 | 2.62 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| yoga | 1 | 2 | 4 | <<<<<<< HEAD4.56 | 7 | 10 | 2.93 | =======4.28 | 7 | 10 | 2.82 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0
<<<<<<< HEADand class 1:
and class 1:
For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).
Running a basic CART model with complexity control cp=0.01, leads to the following tree:
<<<<<<< HEADWhere the key decisions criteria could be explained by the following table.
| Attribute | Name | ||
|---|---|---|---|
| IV1 | attr_o | ||
| IV4 | fun_o | ||
| IV6 | shar_o | ||
| IV2 | sinc_o | ||
| IV3 | intel_o | ||
| IV5 | amb_o | ||
| IV27 | music | ||
| IV12 | career_c | ||
| IV26 | concerts | ||
| IV7 | field_cd | ||
| IV25 | movies | ||
| IV16 | dining | ||
| IV22 | reading | ||
| IV15 | exercise | ||
| IV21 | clubbing | ||
| IV24 | theater | =======IV21 | clubbing |
| IV5 | amb_o | ||
| IV14 | tvsports | ||
| IV24 | theater | ||
| IV17 | museums | ||
| IV16 | dining | ||
| IV18 | art | ||
| IV13 | sports | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
For example, this is how the tree would look like if we set cp = 0.005:
| Attribute | Name | ||
|---|---|---|---|
| IV1 | attr_o | ||
| IV4 | fun_o | ||
| IV6 | shar_o | ||
| IV2 | sinc_o | ||
| IV3 | intel_o | ||
| IV5 | amb_o | ||
| IV7 | field_cd | ||
| IV16 | dining | ||
| IV12 | career_c | ||
| IV15 | exercise | ||
| IV27 | music | ||
| IV26 | concerts | ||
| IV23 | tv | =======IV21 | clubbing |
| IV5 | amb_o | ||
| IV14 | tvsports | ||
| IV27 | music | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| IV18 | art | ||
| IV28 | shopping | ||
| IV13 | sports | ||
| IV17 | museums | ||
| IV25 | movies | ||
| IV22 | reading | ||
| IV19 | hiking | ||
| IV29 | yoga | ||
| IV24 | theater | ||
| IV21 | clubbing | ||
| IV8 | race | ||
| IV9 | goal | ||
| IV11 | go_out | =======IV20 | gaming |
| IV17 | museums | ||
| IV7 | field_cd | ||
| IV13 | sports | ||
| IV29 | yoga | ||
| IV24 | theater | ||
| IV26 | concerts | ||
| IV16 | dining | ||
| IV19 | hiking | ||
| IV10 | date | ||
| IV28 | shopping | ||
| IV25 | movies | ||
| IV22 | reading | ||
| IV8 | race | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:
| Actual Class | Probability of Class 1 | |||
|---|---|---|---|---|
| Obs 1 | 1 | 0.65 | ||
| Obs 2 | <<<<<<< HEAD1 | 0.80 | ||
| Obs 3 | 1 | 0.80 | ||
| Obs 4 | 1 | 0.28 | ||
| Obs 5 | 1 | 0.31 | =======0 | 0.65 |
| Obs 3 | 0 | 0.84 | ||
| Obs 4 | 0 | 0.21 | ||
| Obs 5 | 0 | 0.21 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
| Estimate | Std. Error | z value | Pr(>|z|) | ||||
|---|---|---|---|---|---|---|---|
| (Intercept) | <<<<<<< HEAD-2.4 | 1.0 | -2.5 | =======-5.7 | 1.0 | -5.4 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 |
| attr_o | 0.6 | 0.1 | <<<<<<< HEAD9.9 | =======10.5 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | ||
| sinc_o | <<<<<<< HEAD-0.2 | 0.1 | -2.9 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | 0.1 | 0.3 | 0.8 |
| intel_o | <<<<<<< HEAD0.0 | 0.1 | =======0.1 | 0.1 | 0.7 | 0.5 | |
| fun_o | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.3 | 0.8 | |||||
| fun_o | 0.2 | 0.1 | <<<<<<< HEAD4.1 | =======4.2 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | ||
| amb_o | <<<<<<< HEAD-0.2 | 0.1 | -3.1 | =======-0.3 | 0.1 | -4.3 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 |
| shar_o | 0.3 | 0.0 | <<<<<<< HEAD6.1 | =======6.6 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | ||
| field_cd | 0.0 | 0.0 | <<<<<<< HEAD-1.7 | 0.1 | =======-0.4 | 0.7 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|
| race | 0.1 | 0.1 | 1.3 | 0.2 | |||
| goal | 0.0 | 0.1 | <<<<<<< HEAD-0.3 | 0.8 | |||
| date | -0.1 | 0.1 | -2.1 | 0.0 | |||
| go_out | 0.1 | 0.1 | 1.5 | 0.1 | -0.5 | 0.6 | |
| date | 0.0 | 0.1 | 0.1 | 0.9 | |||
| go_out | -0.1 | 0.1 | -0.7 | 0.5 | |||
| career_c | 0.0 | 0.0 | <<<<<<< HEAD1.6 | 0.1 | |||
| sports | 0.0 | 0.0 | -0.1 | 0.9 | |||
| tvsports | =======-0.9 | 0.4 | |||||
| sports | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | 0.0 | 0.5 | 0.6 | |||
| exercise | -0.1 | =======||||||
| tvsports | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | -2.4 | 0.0 | ||||
| dining | =======|||||||
| exercise | 0.1 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | 1.7 | 0.1 | <<<<<<< HEAD-0.1 | 0.9 | =======|
| dining | -0.1 | 0.1 | -1.4 | 0.2 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| museums | 0.1 | 0.1 | <<<<<<< HEAD0.5 | 0.6 | |||
| art | =======1.5 | 0.1 | |||||
| art | -0.1 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.1 | -1.3 | 0.2 | |||
| hiking | 0.0 | 0.0 | 0.1 | 0.9 | 0.4 | ||
| hiking | =======|||||||
| gaming | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.0 | 0.0 | -0.1 | 0.9 | |||
| gaming | 0.0 | 0.0 | 0.7 | 0.5 | |||
| clubbing | -0.1 | 0.0 | -1.8 | =======||||
| clubbing | 0.0 | 0.0 | -1.1 | 0.3 | |||
| reading | 0.0 | 0.0 | 0.0 | 1.0 | |||
| tv | 0.0 | 0.0 | -0.6 | 0.6 | |||
| theater | 0.0 | 0.0 | 0.8 | 0.4 | |||
| movies | -0.1 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.1 | -0.9 | 0.3 | |||
| reading | -0.1 | 0.1 | -1.8 | 0.1 | |||
| tv | -0.1 | 0.0 | -1.3 | 0.2 | |||
| theater | -0.1 | 0.1 | -2.1 | 0.0 | |||
| movies | 0.0 | 0.1 | -0.5 | 0.6 | |||
| concerts | 0.0 | 0.1 | -0.3 | 0.8 | |||
| music | 0.0 | 0.1 | 0.6 | 0.5 | |||
| shopping | 0.0 | 0.0 | 0.1 | 0.9 | |||
| yoga | 0.0 | 0.0 | -1.0 | 0.3 |
Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.
Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.
Beloow table shows us key drivers of the classification according to each of the used methods.
| CART 1 | CART 2 | Logistic Regr. | Random Forests - mean decrease in accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| attr_o | 1.00 | 1.00 | 1.00 | 1.00 | ||||
| sinc_o | <<<<<<< HEAD-0.30 | -0.33 | -0.29 | 0.20 | ||||
| intel_o | 0.30 | 0.31 | 0.03 | 0.17 | ||||
| fun_o | 0.54 | 0.54 | 0.41 | 0.59 | ||||
| amb_o | -0.27 | -0.31 | -0.31 | 0.11 | ||||
| shar_o | 0.30 | 0.35 | 0.62 | 0.65 | ||||
| field_cd | -0.02 | -0.08 | -0.17 | 0.17 | ||||
| race | 0.00 | 0.01 | 0.13 | 0.09 | ||||
| goal | 0.00 | -0.01 | -0.03 | 0.09 | ||||
| date | 0.00 | 0.00 | -0.21 | 0.18 | ||||
| go_out | 0.00 | 0.00 | 0.15 | 0.09 | ||||
| career_c | 0.04 | 0.08 | 0.16 | 0.16 | ||||
| sports | 0.00 | -0.03 | -0.01 | 0.14 | ||||
| tvsports | 0.00 | 0.00 | 0.05 | 0.10 | ||||
| exercise | -0.01 | -0.08 | -0.24 | 0.19 | ||||
| dining | -0.02 | -0.08 | -0.01 | 0.13 | ||||
| museums | 0.00 | 0.03 | 0.05 | 0.13 | ||||
| art | 0.00 | 0.04 | 0.09 | 0.13 | ||||
| hiking | 0.00 | 0.02 | 0.08 | 0.16 | ||||
| gaming | 0.00 | 0.00 | 0.07 | 0.11 | ||||
| clubbing | -0.01 | -0.01 | -0.18 | 0.10 | ||||
| reading | -0.01 | -0.02 | -0.18 | 0.14 | ||||
| tv | 0.00 | -0.04 | -0.13 | 0.20 | ||||
| theater | -0.01 | -0.01 | -0.21 | 0.16 | ||||
| movies | -0.02 | -0.02 | -0.05 | 0.15 | ||||
| concerts | -0.03 | -0.05 | -0.03 | 0.14 | ||||
| music | 0.06 | 0.06 | 0.06 | 0.15 | ||||
| shopping | 0.00 | 0.04 | 0.01 | 0.16 | =======0.23 | 0.24 | 0.03 | 0.17 |
| intel_o | 0.20 | 0.21 | 0.07 | 0.18 | ||||
| fun_o | 0.48 | 0.48 | 0.40 | 0.53 | ||||
| amb_o | -0.05 | -0.05 | -0.41 | 0.15 | ||||
| shar_o | 0.44 | 0.43 | 0.63 | 0.66 | ||||
| field_cd | 0.00 | -0.02 | -0.04 | 0.14 | ||||
| race | 0.00 | 0.00 | 0.12 | 0.13 | ||||
| goal | 0.00 | 0.00 | -0.05 | 0.20 | ||||
| date | 0.00 | 0.01 | 0.01 | 0.17 | ||||
| go_out | 0.00 | 0.00 | -0.07 | 0.11 | ||||
| career_c | 0.00 | 0.00 | -0.09 | 0.13 | ||||
| sports | 0.00 | 0.00 | 0.00 | 0.18 | ||||
| tvsports | -0.03 | -0.05 | -0.10 | 0.18 | ||||
| exercise | 0.00 | 0.00 | 0.16 | 0.18 | ||||
| dining | -0.01 | -0.01 | -0.13 | 0.16 | ||||
| museums | 0.01 | 0.03 | 0.14 | 0.17 | ||||
| art | -0.01 | -0.03 | -0.12 | 0.19 | ||||
| hiking | 0.00 | 0.01 | 0.01 | 0.18 | ||||
| gaming | 0.00 | -0.03 | -0.01 | 0.17 | ||||
| clubbing | -0.17 | -0.17 | -0.10 | 0.13 | ||||
| reading | 0.00 | 0.00 | 0.00 | 0.18 | ||||
| tv | 0.00 | 0.00 | -0.06 | 0.14 | ||||
| theater | 0.01 | 0.01 | 0.08 | 0.21 | ||||
| movies | 0.00 | -0.01 | -0.09 | 0.14 | ||||
| concerts | 0.00 | 0.01 | 0.05 | 0.15 | ||||
| music | 0.00 | -0.04 | -0.06 | 0.20 | ||||
| shopping | 0.00 | 0.01 | 0.23 | 0.22 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||
| yoga | 0.00 | <<<<<<< HEAD-0.02 | -0.10 | 0.15 | =======-0.01 | -0.12 | 0.13 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
In general we do not see very significant differences across all used methods which makes sense.
Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:
| Hit Ratio | ||
|---|---|---|
| First CART | <<<<<<< HEAD65.18519 | |
| Second CART | 68.14815 | |
| Logistic Regression | 74.07407 | |
| Random Forests | 67.40741 | =======65.43210 |
| Second CART | 66.66667 | |
| Logistic Regression | 70.98765 | |
| Random Forests | 68.51852 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
while for the estimation data the hit rates are:
| Hit Ratio | |
|---|---|
| First CART | <<<<<<< HEAD74.34944 |
| Second CART | 79.55390 |
| Logistic Regression | 74.72119 |
| Random Forests | 99.07063 |
A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 58 out of 135 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 42.96% - without doing any work.
=======A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 102 out of 162 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 62.96% - without doing any work.
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481fIn our case this particular criterion is met for all the methods that were used.
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:
| Predicted 1 | Predicted 0 | |||
|---|---|---|---|---|
| Actual 1 | <<<<<<< HEAD89.61 | 10.39 | ||
| Actual 0 | 53.45 | 46.55 | =======58.33 | 41.67 |
| Actual 0 | 78.43 | 21.57 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
The ROC curves for the validation data for all four methods is below:
<<<<<<< HEADThe Lift curves for the validation data for our four classifiers are the following:
<<<<<<< HEADBelow are presented hit ratios for all four methods based on test dataset:
| Hit Ratio | ||
|---|---|---|
| First CART | <<<<<<< HEAD74.81481 | |
| Second CART | 69.62963 | |
| Logistic Regression | 68.14815 | |
| Random Forests | 72.59259 | =======71.77914 |
| Second CART | 70.55215 | |
| Logistic Regression | 76.07362 | |
| Random Forests | 74.23313 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
The Confusion Matrix for the model with the best validation data hit ratio above:
| Predicted 1 | Predicted 0 | |
|---|---|---|
| Actual 1 | <<<<<<< HEAD67 | 33 |
| Actual 0 | 69 | 31 |
ROC curves for the test data:
Lift Curves for the test data:
ROC curves for the test data:
Lift Curves for the test data: